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Foreword 

The  original  objectives  of  this  project  were  to  develop  the  technologies  and  design  automation 
environment  for  high  clock-rate  MCM-packaged  gallium  arsenide  circuits  which  used  flip-chip  array  I/O 
interconnect,  and  to  demonstrate  these  technologies  in  the  a  prototype  microprocessor.  CAD  tools  were  to 
be  developed  to  support  optimization  of  such  systems  for  performance,  power  and  cost. 

The  project  involved  close  collaboration  with  two  subcontractors,  Motorola  Semiconductor  for  the 
complementary  gallium  arsenide  (CGaAs)  technology,  and  Cascade  Design  Automation  for  CAD  tools. 
Motorola  scaled  the  CGaAs  process  from  0.7  pm  minimum  dimensions  to  0.5  pm,  improved  the  CGaAs 
yield  significantly,  conducted  experiments  in  reducing  threshold  voltages,  and  fabricated  prototype  circuits. 
In  the  early  days  of  the  project,  Motorola’s  Space  and  Systems  Technology  Group  was  a  regular  subcon¬ 
tractor,  but  when  the  budget  was  reduced,  they  agreed  to  continue  to  collaborate  with  us,  bearing  all  of  the 
costs  themselves:  most  of  the  work  was  done  under  this  arrangement.  Cascade  provided  the  physical 
design  tools  and  support  for  these  tools  throughout  the  project.  In  addition,  they  developed  a  special  tool 
for  placing  arrayed  I/O  pads  on  chips  for  flip-chip  assembly.  At  the  University  of  Michigan,  a  PowerPC 
architectural  simulator  was  developed  to  evaluate  cycles-per  instruction  for  various  microarchitectures,  a 
CGaAs  cell  library  was  developed,  and  all  of  the  design  and  testing  of  circuits  was  done. 

Early  in  the  project,  the  goal  was  to  design  the  prototype  system  to  operate  with  a  1  GHz  clock, 
and  advances  were  made  in  complementary  GaAs  technology,  circuits,  and  packaging  to  enable  this. 
When  Motorola  decided  that  CGaAs  was  the  technology  for  their  proposed  Celestri  satellite  system,  they 
fixed  the  minimum  dimensions  at  0.5  pm,  rather  than  further  scaling  the  process,  and  froze  the  thresholds 
at  +/-  0.55  V,  rather  than  reducing  them;  on  the  positive  side,  they  added  a  low-temperature  GaAs  buffer 
layer  under  the  devices,  which  improved  subthreshold  slope  and  made  the  transistors  extremely  radiation 
hard  to  single-event  upset.  These  decisions  eliminated  the  possibility  of  building  a  fast  processor  in 
CGaAs,  so  at  this  point,  the  focus  of  the  prototype  system  was  shifted  to  space  applications,  which  could 
take  advantage  of  both  the  radiation  hardness,  and  the  excellent  power-delay  product  of  CGaAs. 

CGaAs  technology  was  analyzed  to  determine  the  most  cost-effective  scaling  factor  for  each 
design  rule;  the  methodology  and  tools  developed  for  this  can  be  applied  also  to  the  nonlinear  scaling  of 
deep-submicron  CMOS  processes.  Static,  domino  and  dual-rail  domino  (CVSL)  circuits  were  designed  to 
evaluate  CGaAs  for  use  in  VLSI  digital  circuits.  Phase-Locked  Loop  and  current-mode  I/O  circuits  were 
designed  and  tested.  To  facilitate  the  design  of  high-performance  integrated  circuits,  logic  synthesis  and 
place  and  route  tools  were  written.  A  gold-bumping  process  was  developed  in  the  UM  solid-state  electron¬ 
ics  laboratory  which  produces  bumps  on  pitches  as  tight  as  50  pm.  A  superscalar  PowerPC  microarchitec¬ 
ture  was  developed  for  implementation  in  CGaAs  with  its  modest  integration  levels,  and  fabricated  in 
CMOS  to  prove  correctness  of  the  design.  The  project  culminated  in  the  design  and  testing  of  the  radia¬ 
tion-hard  CGaAs  PUMA  PowerPC  microprocessor,  which  incorporates  an  area-l/O  array. 
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Statement  of  Problem 

Microprocessors  have  had  a  profound  impact  on  both  the  scientific  world  and  on  our  personal 
lives.  Through  astonishing  advances  in  performance,  they  have  replaced  traditional  mainframes  and 
supercomputers  with  microprocessor-based  workstations  and  servers  [1].  The  remarkable  decrease  in 
cost  vs.  performance  for  microprocessors  has  made  computing  ubiquitous  in  our  society. 

Processor  trends  can  be  seen  by  surveying  the  microprocessors  presented  at  ISSCC.  Over  the 
ten  years  before  this  project  began,  gate  delays  improved  at  12%  per  year;  clock  frequencies  increased  at 
40%  per  year;  and  transistor-counts  grew  at  40%  per  year  [2].  The  performance  of  systems  made  from 
these  processors,  as  measured  by  the  integer  SPEC  benchmarks,  improved  at  a  compounded  rate  of  59% 
per  year,  resulting  in  an  increase  in  computing  power  of  more  than  100  fold  over  the  decade. 

The  disparity  in  improvement  between  gate  delay  and  clock  frequency  was  due  to  the  fact  that 
some  of  the  additional  transistors  made  available  during  these  years  were  used  to  pipeline  processors, 
reducing  the  number  of  gates  between  latches.  However,  with  processors  having  on  the  order  of  ten  pipe¬ 
line  stages,  additional  pipeline  depth  provides  diminishing  performance  returns,  and  could  not  be  expected 
to  maintain  the  steep  increase  in  clock  frequency  seen  before.  The  growing  transistor  budget  also  sup¬ 
ported  the  addition  of  on-chip  cache  memory,  which  reduced  load/store  latency  to  memory.  But  again, 
with  large  first-  and  second-level  instruction  and  data  caches  on  chip,  the  performance  return  for  enlarging 
cache  memories  beyond  their  current  sizes  was  modest.  To  keep  new  processors  on  the  performance 
curve,  architects  also  invested  their  additional  transistors  in  multiple  functional  units  for  concurrent  instruc¬ 
tion  execution  (superscalar  architectures).  Unfortunately,  the  benefits  of  parallelism  also  diminish  with 
scale  in  general  purpose  machines. 

With  pipelining,  on-chip  cache  size,  and  instruction  issue  width  all  at  their  points  of  diminishing 
return,  semiconductor  manufacturers  turned  to  technology  to  keep  the  growth  in  computer  performance  on 
the  curve.  Increased  attention  to  the  scaling  of  CMOS,  the  inclusion  of  more  fine-pitch  metal  interconnect 
layers,  and  more  aggressive  circuit  techniques  allowed  vendors  to  increase  the  clock  frequency  and 
thereby  increase  processor  throughput.  The  importance  of  semiconductor  technology  to  future  high-end 
computer  performance  warranted  the  evaluation  of  processes  such  as  Complementary  GaAs  and  SOI, 
which  were  outside  of  the  mainstream.  CGaAs  has  the  switching  speed  of  HEMTs  with  many  of  the  circuit 
advantages  of  CMOS.  The  low-voltage  operation  of  CGaAs,  combined  with  the  good  switching  speed, 
give  CGaAs  an  excellent  power-delay  product.  Process  changes  during  this  project  made  it  extremely 
radiation  hard. 

When  this  project  began,  CGaAs  had  been  targeted  primarily  at  RF  applications,  with  little  digital 
work  having  been  done.  This  project  aimed  to  explore  CGaAs  technology  for  VLSI  digital  circuits,  and  to 
provide  the  packaging  technologies  needed  to  make  it  viable  in  such  applications. 
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Summary  of  Results 

This  section  of  the  report  includes  an  overview  of  CGaAs  technology,  a  list  of  accomplishments, 
and  conclusions  that  can  be  drawn  from  the  work  in  this  project. 


Overview  of  CGaAs  Technology 

CGaAs,  a  complementary  heterostructure-insulated-gate  FET  technology,  has  been  described  in 
some  detail  in  [3-5].  A  sketch  of  the  device  structure  is  shown  in  Fig.  1 .  CGaAs  integrates  an  enhance¬ 
ment-mode  P-channel  HFET  with  a  high  performance  N-channel  HFET.  Historically,  the  primary  interest  in 
GaAs  and  other  IIIA/  materials  has  derived  from  their  high  electron  mobilities.  While  holes  in  lll/V  materials 
do  not  enjoy  an  intrinsic  mobility  advantage  over  those  in  silicon,  the  pseudomorphic  P-channel  HFETs  in 
this  process  have  three  to  five  times  higher  transconductance  at  given  gate  dimensions  than  their  silicon 
counterparts.  As  seen  in  Fig.  1 ,  the  CGaAs  process  uses  epitaxially-grown  wafers,  the  cost  of  which 
includes  both  that  of  the  initial  GaAs  wafer  and  of  growing  the  additional  layers.  In  moderate  volumes,  this 
can  be  20  to  25  times  the  cost  of  a  silicon  substrate.  Though  the  wafers  are  smaller  and  more  expensive 
than  silicon,  the  CGaAs  process  requires  only  13  masks  through  three  levels  of  interconnect,  compensat¬ 
ing  in  part  with  process  efficiency  for  the  more  costly  starting  material.  CGaAs,  however,  has  not  enjoyed 
the  efficiencies  of  high-volume  production  as  has  CMOS,  and  completed  wafers  from  a  high  volume  CMOS 
process  line  cost  approximately  40%  to  50%  as  much  as  completed  CGaAs  wafers.  Considering  all  of 
these  factors,  the  price  of  a  finished  complementary  gallium  arsenide  die  is  approximately  4.8  times  the 
price  of  a  similar-size,  high  volume  CMOS  die. 

Standard  gate  lengths  were  scaled  from  0.7  to  0.5  pm,  and  experimental  N-channel  devices  at 
0.35  pm  gate  lengths  performed  well.  CGaAs  has  a  power-delay  product  of  0.01  mW/MHz/gate  at  0.7-pm 
minimum  feature  size.  Recent  improvements  to  the  epi  structure  make  the  CGaAs  process  resistant  to  sin¬ 
gle-event  upset  (fewer  than  10'9  Upsets/Bit-Day  for  SCFL  logic  and  10'10  Upsets/Bit-Day  for  complemen¬ 
tary  logic),  as  well  as  to  large  total  dose  radiation  (more  than  108  Rads)  and  latchup  (more  than  1012 
Rads).  Typical  parameters  for  0.7-pm  channel-length  devices  having  +/-  0.55  V  thresholds  (measured  with 
Vdd  =  1.5  V)  are  given  in  Table  1.  As  seen  in  the  table,  both  N  and  P  channel  devices  have  good  output 
conductances  and  pinch-off  characteristics.  The  original  device  thresholds  of  +/-  0.55  V  were  selected 
because  they  yielded  the  optimum  power-delay  product  in  complementary  circuits;  they  produce  a  drain- 
current  ratio  between  N  and  P  devices  of  about  4:1 . 
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Fig.  1 :  CGaAs  process  cross-section. 
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Table  1 :  CGaAs  Device  Parameters. 


Parameter 

NFET 

(0.7x10  mm) 

PFET 

(0.7x10  mm) 

Vth(V) 

+0.55 

-0.55 

IdSS  (mA) 

1.8 

0.5 

gm  (mS/mm) 

280 

60 

Beta  (mA/V2-mm) 

270 

50 

Subth  slope  (mV/dec) 

75 

90 

Subth  Current  (nA)  (Vgs=0V) 

<1 

<10 

The  low  threshold  voltages  and  high  transconductance  of  CGaAs  result  in  good  performance  at 
low  voltages.  Fig.  2  shows  unloaded  ring  oscillator  delays  versus  supply  voltage  for  several  logic  families 
(thresholds  of  +/-  0.55V).  The  delay  of  1.0-pm  CGaAs  is  less  than  that  of  0.5-pm  CMOS  or  Thin-Film  Sili- 
con-on-lnsulator  (TFSOI),  and  the  0.5-pm  CGaAs  shows  delays  below  100  ps  with  a  1.2V  power  supply. 
Power  dissipation  is  not  indicated  in  the  figure.  Lower  threshold  voltages  will  make  the  CGaAs  circuits 
faster  yet. 

Two  key  parameters  of  concern  in  complementary  heterostructure  FET  devices  are  gate  leakage 
and  sub-threshold  drain-source  leakage  [3,  6],  which  determine  the  stand-by  power  dissipation  of  comple¬ 
mentary  circuits.  Unlike  Si  CMOS,  which  has  an  Si02  g3te  insulator,  the  CGaAs  gate  is  a  Schottky  diode 
to  AIGaAs.  Substantial  gate  current  flows  for  gate  voltages  in  excess  of  about  one  volt.  Gate  leakage  cur¬ 
rent  depends  on  the  Schottky  barrier  height  and  band  offsets.  The  large  valence  band  offset  (about  0.55 
V)  of  high  mole-fraction  AIGaAs,  as  used  in  these  devices,  improves  the  gate  leakage  of  PFETs.  Typically, 

the  PFET  gate-diode  turn-on  voltage,  defined  as  the  gate  voltage  resulting  in  1  pA/pm2  gate  area  at  Vds  = 
0,  is  -2V.  NFETs  have  a  turn-on  voltage  of  1 .75  V.  The  gate  turn-on  voltages  are  also  influenced  by 
implant  straggle  effects.  Drain-induced  barrier  lowering  increases  gate  current  when  the  drain-to-source 
voltage  is  high  (which  occurs  when  a  logic  input  changes  state). 
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Fig.  2:  Propagation  delay  of  CGaAs,  CMOS,  and  TFSOI  versus 
Supply  Voltage.  Gates  are  driving  one  load. 
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Accomplishments 

The  power  efficiency  and  radiation  hardness  of  CGaAs  make  it  attractive  for  space  and  satellite 
applications.  However,  CGaAs  presents  design  challenges  such  as  reduced  power-supply  voltage  (little 
headroom  in  circuits),  proportionately  large  threshold  voltages  (lower  speed  than  could  otherwise  be 
achieved  by  the  HEMTs),  gate  and  subthreshold  drain-source  leakage  (higher  power),  and  low  integration 
levels  (restrictions  on  architectures).  CGaAs  technology  has  been  studied  in  this  project  for  implementing 
large  VLSI  circuits  such  as  microprocessors,  in  light  of  these  challenges. 

The  Motorola  0.5-pm  CGaAs  process,  which  had  been  developed  by  shrinking  the  gate  length  of  a 
0.7-pm  process,  was  the  primary  technology  employed  in  this  project.  It  was  clear  from  the  beginning  that 
the  design  rules  were  not  optimal  and  that  the  process  needed  to  be  scaled.  A  considerable  amount  of 
process  development  work  was  done  on  CGaAs  to  shift  the  thresholds,  before  the  decision  was  made  that 
they  would  be  fixed  at  +/-  0.55  V  in  order  to  assure  that  circuits  could  be  delivered  on  schedule  for  the 
Celestri  program.  The  yield  was  significantly  improved  on  the  CGaAs  process  through  a  detectivity  pro¬ 
gram,  and  subthreshold  leakage  was  reduced  by  adding  a  low-temperature  GaAs  buffer  layer.  A  negative 
impact  of  this  layer  was  that  it  reduced  transistor  gain,  but  because  it  is  characterized  by  a  very  short  car¬ 
rier  lifetime,  it  provides  single-event  upset  protection  for  the  process,  which,  because  it  has  no  Si02  gate 
oxide  nor  device  isolation,  has  always  been  intrinsically  hard  to  total  radiation  dose  effects. 

As  CMOS  processes  are  shrunk  below  0.18  pm,  the  linear  scaling  of  some  design  rules  will  be 
very  difficult,  so  non-linear  scaling  will  be  needed  for  CMOS  in  the  near  future.  Working  with  Motorola  pro¬ 
cess  engineers,  we  evaluated  CGaAs  for  scaling.  In  doing  so,  we  developed  a  general  (works  for  any  tech¬ 
nology)  methodology  for  quantitatively  evaluating  semiconductor  processes  for  optimal  scaling.  The 
methodology  includes  identifying  the  design  rules  which  have  the  greatest  impact  on  the  scaling  objective 
and  analyzing  the  area,  power  and  performance  improvements  as  these  rules  are  incrementally  scaled. 
The  improvement  data  is  combined  with  die  cost  estimates  to  produce  a  cost/benefit  ratio  which  can  guide 
scaling  decisions.  The  methodology  is  based  on  the  automated  analysis  of  embedded  static  RAMs  gener¬ 
ated  by  a  process-independent,  optimizing  SRAM  compiler  developed  as  part  of  this  project.  A  cost/bene¬ 
fit  analysis  of  the  CGaAs  design  rules  shows  that  when  operating  under  a  fixed  spending  cap,  this 
nonlinear  scaling  approach  can  provide  greater  improvements  in  area  and  performance  than  linear  scaling. 
The  analysis  results  for  the  0.5-pm  CGaAs  process  recommend  that  threshold  voltages  be  reduced,  and 
that  the  first  of  a  number  of  recommended  scaling  steps  should  be  a  30%  reduction  of  the  source  drain 
area  and  via/metal  pitch. 

Full  complementary,  unipolar  (pseudo  direct-coupled  FET  logic),  pass-gate  logic,  and  domino  logic 
styles  were  evaluated  in  the  complementary  GaAs  technology.  A  logic-evaluation  test  chip  was  fabricated 
at  Motorola.  Because  initial  evaluations  of  dynamic  logic  yielded  promising  results,  a  PowerPC  ALU  was 
designed  in  Domino  logic.  While  this  circuit  was  in  fabrication,  a  yield-compromising  design  rule  problem 
was  identified;  it  became  necessary  to  break  gate-metal  runs  between  n-  and  p-transistors  to  avoid  leak¬ 
age  paths.  This  test  run  did  provide  valuable  experience  with  the  various  logic  styles,  but  because  of  the 
design-rule  problem,  did  not  yield  on  the  dynamic  ALU.  An  environment  to  help  circuit  designers  optimize 
transistor  sizes  in  SPICE  netlists  over  power,  area  and  delay  was  developed  as  part  of  this  effort.  Changes 
in  focus  at  Motorola  during  this  time  caused  the  project  to  shift  from  high-clock  rate  designs  to  radiation- 
hard  applications. 

The  original  plan  included  implementation  of  a  virtual  memory  system,  and  a  software-managed 
in-cache  translation  mechanism  for  the  processor  was  developed.  This  is  an  extremely  low-overhead  mem¬ 
ory  management  scheme  which  provides  all  the  benefits  of  traditional  schemes  but  removes  a  substantial 
amount  of  hardware  from  the  critical  path,  enabling  much  faster  clock  speeds.  When  the  project  budget 
was  reduced  with  a  reorganization  at  DARPA,  we  dropped  the  virtual  memory  and  floating-point  unit. 

A  trace-driven  architectural  simulator  was  developed  to  guide  the  design.  To  verify  functionality  of 
the  PUMA  design,  we  developed  a  random  instruction  generator  which  produces  code  based  on  a  user- 
specified  maximum  number  of  loops  and  branches,  and  on  flags  specifying  whether  to  use  unimplemented 
instructions  and  misaligned  memory  accesses.  Certain  classes  of  instructions  can  be  exercised,  and  reg¬ 
ister  usage  can  be  limited  in  order  to  stress  forwarding  interlocks.  Simulation  results  with  this  code  running 
on  the  Verilog  PUMA  model  and  on  a  PowerPC  architectural  simulator  are  compared  to  verify  proper  func¬ 
tionality. 
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Architectural  methods  of  enhancing  processor  performance  within  the  constraint  of  limited  on-chip 
cache  were  explored.  A  method  of  prefetching  called  'runahead’  allows  the  processor  to  execute  instruc¬ 
tions  under  a  cache  miss,  exposing  other  loads  and  stores  that  might  have  also  generated  cache  misses, 
so  that  these  can  be  prefetched.  A  second  approach  we  have  evaluated  scans  the  instruction  stream  for 
branches  as  the  instruction  cache  is  loaded,  and  uses  branch-prediction  information  to  prefetch  further 
instructions. 

Development  of  the  PUMA  processor  architecture  was  driven  by  the  limited  CGaAs  integration 
level.  The  processor  is  implemented  with  a  small  on-chip  primary  instruction  cache  and  a  larger  off-chip 
primary  data  cache.  The  instruction  fetch  mechanism  is  guided  by  an  efficient  two-level  dynamic  branch 
predictor  and  branch  target  buffer.  Computation  is  performed  by  a  small  superscalar  execution  core  com¬ 
prised  of  branch,  arithmetic,  and  load/store  units.  Based  on  trace-driven  simulations  of  standard  bench¬ 
mark  programs,  the  architecture  should  achieve  0.77  instructions  per  cycle.  Out-of-order  execution  is 
supported  by  dedicated  reservation  stations  for  each  functional  unit  and  an  eight-entry  reorder  buffer.  The 
decode  process  translates  complex  PowerPC  instructions  into  one  or  more  simple  RISC  operations.  A 
0.35  pm  CMOS  version  of  the  architecture  was  first  prototyped.  It  has  280  pins,  measures  9.9  x  9.9  mm, 
and  contains  830K  transistors.  The  chip  was  packaged  in  a  391 -pin  ceramic  PGA.  The  chip  is  not  fully 
tested  yet,  but  so  far,  no  errors  have  been  detected. 

The  project  culminated  in  the  design  and  testing  of  the  radiation-hard  CGaAs  PUMA  PowerPC 
microprocessor,  shown  in  Fig.  3,  which  incorporates  an  area-l/O  array.  This  CGaAs  version  was  further 
simplified  to  meet  an  integration  limit  of  400,000-transistors:  the  data  cache  was  moved  off  chip,  out-of- 
order  execution  was  eliminated,  and  the  architecture  was  modified  to  be  single-issue.  The  CGaAs  chips 
were  fabricated  at  Motorola  and  ten  of  these  chips  were  assembled  (using  just  the  peripheral  I/O)  in  con¬ 
ventional  PGA  packages  for  initial  testing.  Of  the  ten  microprocessors  packaged,  all  of  them  passed  basic 
tests  but  only  two  sequenced  and  executed  instructions  properly. 


Fig.  3:  CGaAs  Power  PC. 
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The  two  chips  that  passed  had  varying  degrees  of  success  with  more  advanced  tests.  None  of  the 
devices  passed  all  the  tests  completely.  Immediate  instructions  worked  and  program  address  sequencing 
worked,  but  instructions  that  manipulate  register  data  gave  bad  data  out.  Functionality  of  the  ALU,  load/ 
store  unit,  and  the  branch  unit  can  be  inferred  from  these  tests,  however;  since  output  data  is  often  bad  it 
is  not  known  if  errors  are  introduced  by  registers  or  the  buses.  The  branch  instructions  did  work  success¬ 
fully.  Using  branches,  the  critical  path  of  the  FXU  could  be  tested.  The  FXU  operates  at  maximum  fre¬ 
quencies  of  42  MHz  on  chip  one  and  33  MHz  on  chip  two.  This  test  exercises  only  the  critical  path  in  the 
branch  unit  with  certainty.  There  is  not  much  difference  in  power  dissipation  between  operating  frequen¬ 
cies,  indicating  that  most  of  the  power  is  dissipated  as  static  power.  Eighteen  percent  of  the  power  is  dissi¬ 
pated  in  the  core,  the  remainder  in  the  pads.  At  a  nominal  operating  voltage  of  1 .3  V,  the  FXU  can  be  run 
optimally  at  20  MHz  dissipating  274  mW. 

None  of  the  devices  passed  the  instruction  cache  tests  indicating  non-functional  caches.  More 
detailed  cache  testing  was  performed  on  a  separate  2  Kbyte  SRAM  chip.  It  used  the  same  SRAM  design 
as  the  FXU  caches.  These  chips  also  failed.  The  data  out  always  followed  the  data  in,  indicating  that  the 
decoder  was  not  working  correctly.  The  decoder  uses  DCFL  NOR  gates.  The  ratios  of  these  gates  were 
not  sufficient  to  provide  a  low  enough  output  low  voltage  over  process  corners.  Process  data  showed  that 
the  beta  values,  drive  currents,  and  leakage  currents  of  the  N  and  P  transistors  as  well  as  the  threshold 
voltage  of  the  P  devices  had  a  much  wider  distribution  than  anticipated.  The  degradation  of  the  P  device 
indicated  by  process  data  could  also  explain  the  other  results  from  the  testing.  Leakage  currents  would  be 
higher  and  some  gates  may  not  turn  off  at  all,  adding  to  static  power  and  data  errors.  Further  testing  of  the 
scan  path  and  circuit  simulations  with  the  measured  process  corners  should  help  identify  the  exact  prob¬ 
lems.  Unfortunately,  with  the  collapse  of  the  Celestri  project,  Motorola  is  no  longer  running  the  CGaAs  pro¬ 
cess,  so  there  is  no  opportunity  to  modify  the  circuits  for  another  run,  and  no  chance  of  getting  tighter 
process  parameter  control. 

The  PUMA  project  has  developed  new  packaging  and  I/O  signalling  capabilities  which  are  appro¬ 
priate  for  military  and  aerospace  applications  now,  as  well  as  for  future  commercial  CMOS  systems.  The 
processor  chip  includes  a  31 5-pin  area  I/O  pad  array  with  a  pad  pitch  of  6  mils,  in  addition  to  288  pads  in  a 
staggered  peripheral  ring.  It  is  designed  for  flip-chip  assembly  using  gold  bumps  on  a  fine-pitch  MCM-L 
board,  connecting  it  to  level-1  data  cache,  a  memory  management  unit,  PCI  interface,  and  unified  level-2 
cache.  The  gold  bumping  process,  which  makes  precisely-sized  bumps  of  the  desired  aspect  ratios,  was 
developed  in  the  University  of  Michigan  Solid-State  Electronics  Laboratory.  A  multichip  module,  fabricated 
by  Micromodule  Systems,  is  a  test  vehicle  for  exploring  design  issues  such  as  flip-chip  area  array  attach¬ 
ment  for  more  than  1 ,000  pads,  minimum  feasible  pad  pitch,  and  pad  yields  for  various  pad  pitches  (50, 75, 
100, 125,  and  175  pm  pitch). 

The  PUMA  project  has  also  contributed  to  high-performance  signalling  technology.  CGaAs  Gun¬ 
ning  transceiver,  differential  voltage,  and  switched  current  I/O  interfaces  have  been  designed,  fabricated 
and  tested.  Test  results  indicate  that  these  circuits  in  CGaAs  can  support  bit  rates  of  at  least  650  Mb/s/pin 
(limited  by  the  test  set-up).  An  advanced  transceiver  based  on  switched-current  techniques  has  also  been 
designed.  The  receiver  actively  terminates  the  input  line  to  its  characteristic  impedance  using  an  active 
current  mirror.  The  transmitted  current  pulse  is  1 .5  mA.  The  receiver  is  biased  using  a  feedback  circuit  that 
overcomes  parametric  variations  between  the  transmitting  and  receiving  chips;  it  compensates  for  pro¬ 
cessing  variations  by  adjusting  the  bias  levels  of  the  receiving  chip.  Simulations  indicate  that  the  circuit 
can  support  1.2  Gb/s/pin  signaling  while  dissipating  only  3.3  mW,  with  a  1.4  V  supply.  A  CGaAs  delay- 
locked-loop  (DLL)  has  been  designed  to  explore  the  effects  of  low  supply  voltage  and  headroom  on  phase 
noise  performance.  Simulations  indicate  that  the  DLL  would  operate  at  500  MHz,  with  a  peak  jitter  of  88pS. 

A  CGaAs  PLL  clock  generator  was  also  designed  and  tested  in  this  project.  It  operated  at  up  to 
800  MHz  with  a  1 .5  V  supply  and  120  ps  phase  jitter.  The  CGaAs  design  was  operational  at  a  supply  volt¬ 
age  as  low  as  0.8  V.  A  test  MCM  and  CGaAs  driver  and  receiver  chips  were  designed  for  use  with  this 
PLL,  to  evaluate  MCM  signal  integrity  with  low-voltage,  high-edge-rate  signals,  and  to  test  various  driver 
and  receiver  circuits.  The  MCM  was  fabricated  at  MicroModule  Systems  (MMS),  through  Midas.  The 
MCM  included  Mayo-designed  passive  test  structures  for  measuring  the  MCM  interconnect  properties. 
The  PLL  clock  generator  was  designed  to  phase  lock  to  a  low-speed  input  clock  and  produce  a  program¬ 
mable  multiple  of  this  frequency  for  use  as  the  GaAs  microprocessor  clock. 
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An  accurate  phase  jitter  simulation  method  was  developed,  which  includes  the  phase  jitter  model 
in  transient  simulations.  Employing  current-steering  logic,  we  designed,  fabricated  and  tested  a  low  noise 
PLL  clock  generator  in  a  0.5  pm  CMOS  process.  This  design,  which  benefited  from  availability  of  the  jitter 
simulator,  was  also  fabricated  and  tested.  It  achieves  a  top  frequency  of  nearly  800  MHz  with  a  power  sup¬ 
ply  voltage  of  1 .8  V,  a  measured  absolute  phase  jitter  of  less  than  60  ps,  and  an  RMS  cycle-to-cycle  phase 
jitter  of  10  ps.  This  was  the  best  phase  jitter  performance  at  that  time,  and  it  was  achieved  with  low-voltage 
techniques  which  will  have  direct  applicability  to  future  CMOS  circuits. 

Several  CAD  tools  were  developed  which  support  the  design  of  advanced  integrated  circuits  such 
as  those  from  the  PUMA  project.  A  high-level  optimization  tool  called  GAIN  (Genetic  Algorithm  on  the 
INternet),  was  developed  to  assist  a  designer  in  judiciously  allocating  resources  and  partitioning  logic  onto 
chips  in  MCM  designs.  It  uses  a  genetic  algorithm  to  explore  permutations  of  a  baseline  architecture, 
spawning  trace-driven  simulation  jobs  on  a  network  of  workstations  so  that  many  options  can  be  evaluated 
in  parallel. 

Our  subcontractor,  Cascade  Design  Automation,  developed  a  cell  library  migration  tool  called 
MasterPort,  which  converts  a  GDSII  input  layout  to  compacted  layout  in  a  specified  rule  set.  The  tool  auto¬ 
matically  generates  the  constraints  and  solves  the  constraint  equations.  It  was  used  in  the  development  of 
more  than  120  cells  for  test  chips  designed  in  the  project,  and  was  very  helpful  in  keeping  the  cells  updated 
in  the  rapidly  evolving  CGaAs  process.  Cascade  also  developed  an  area-distributed  pad  router,  called 
Eggo,  which  worked  with  existing  placement  tools  to  minimize  power  and  signal  routing  between  the  array 
of  bumps  on  the  surface  of  a  chip  and  the  modules  to  which  they  are  connected. 

TEMPO  is  a  transistor-level  micro-placement  tool  for  two-dimensional  cell  synthesis.  It  generates 
custom-quality  layouts  for  such  high-performance  logic  families  as  cascode  voltage  switch  logic,  pass  tran¬ 
sistor  logic,  and  domino  CMOS.  This  is  achieved  through  powerful  transformations  such  as  dynamic 
geometry  sharing  through  transistor  chaining  and  arbitrary  geometry  merging.  TEMPO  enables  the  quick 
migration  of  cell  libraries  to  new  fabrication  processes. 

A  constructive  logic  synthesis  tool,  called  M31 ,  was  developed  to  interleave  the  traditionally  sepa¬ 
rate  technology-independent  logic  restructuring  and  technology-dependent  library  binding  stages  of  circuit 
synthesis.  M31  is  based  on  Boolean  decomposition  strategy  that  ties  together  1)  the  structural  properties 
of  the  functions  being  synthesized,  2)  the  structural  attributes  of  the  implementation  network,  and  3)  the 
functional  content  of  the  target  library.  The  resulting  implementations  are  consistently  smaller  and  faster 
than  those  generated  using  conventional  logic  synthesis.  In  addition,  they  can  be  incrementally  modified  to 
create  variants  that  achieve  other  area/speed  trade-offs. 

A  methodology  and  tools  for  minimizing  the  effects  of  capacitively-coupled  crosstalk  were  also 
developed.  By  using  an  accurate  and  consistent  empirical  model  for  wiring  resources  and  constraints, 
coupled  noise  and  delay  were  made  predictable,  and  thus  avoidable.  A  congestion-driven  placement  algo¬ 
rithm  was  developed  to  help  minimize  the  incidence  of  capacitive  coupling,  and  a  global  route-embedder 
was  developed  to  guide  the  detailed  router  to  meet  timing  and  noise  constraints. 

Papers  on  each  of  these  topics  are  included  in  the  list  of  manuscripts  attached.  Presentations  and 
project  details  can  be  found  at  http://www.eecs.umich.edu/UMichMP/. 

Conclusions 

Many  of  the  characteristics  of  CGaAs  make  it  an  ideal  technology  for  space-based  applications. 
Unlike  other  GaAs  technologies,  it  has  a  p-transistor,  which  facilitates  efficient  on-chip  memory,  and  pro¬ 
vides  most  of  the  other  benefits  of  CMOS.  Like  other  GaAs  technologies,  CGaAs  is  generations  behind 
CMOS  in  scaling,  which  means  that  it  cannot  compete  with  CMOS  for  speed.  The  power-delay  product  of 
CGaAs,  though,  is  extremely  good  compared  to  a  similar  generation  of  CMOS,  and  its  radiation  hardness 
is  superb.  CGaAs  devices  do  have  more  gate  and  drain  leakage  than  CMOS.  In  most  respects,  CGaAs 
scales  well;  there  is  no  gate  oxide  to  scale,  the  uniformity  of  which  will  be  a  serious  challenge  for  CMOS 
below  a  certain  thickness.  On  the  other  hand,  making  source  and  drain  contact  to  the  epitaxial  material  is 
difficult,  so  scaling  this  contact  area  is  a  challenge.  And  finally,  the  process  control  was  not  tight  enough, 
and  not  well  enough  defined  to  yield  fully  functional  circuits.  Nevertheless,  a  number  of  useful  contribu¬ 
tions  from  this  project  in  computer  architecture,  circuit  design,  packaging  and  CAD  tools  were  generated  in 
the  PUMA  project. 
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